Claims 

What is claimed is: 

'^^^-^X^ A method for identifying groups of pages of common interest from a 
collection y hyper-linked pages, comprising the steps of: 

IdentVying a plurality of community cores from the collection, each core being 
first and secohd sets of pages, each page in the first set pointing to every page in 
the second seik and 

expandino each identified core into a full community, the full community 
being a subset o\the pages regarding a particular topic. 

2. The method as recited in claim 1 , wherein: 

the collection includes a plurality of sites, each site having one or more 
hyper-linked pages;iQnc 

the method further includes the step of removing the hyper-links between 
any two pages on a same site. 



3. The method as recited in claim 2 further comprising the step of 
discarding the pages of predetermined sites. 



4. The method aa recited in claim 1 further comprising the steps of: 

finding highly similar pages that have different names; 

replacing the highly similar pages with a single representative page; and 
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redirecting any hyper-links tliat pointed to one of the highly similar pages so 
that the redirected hyperi.links now point to the representative page. 

5. The method as recited in claim 1 further comprising the steps of: 
discarding unnecessary pages from consideration to generate a set of 

candidate fan pages and a set of candidate center pages; and 

using the set of candidate fan pages and set of candidate center pages as 
the first and second sets, respectively, to identify the community cores. 

6. The method aslrecited in claim 5, wherein the step of discarding 
includes the steps of: 1 

determining candidate tan pages, the candidate fan pages being those 
pointing to at least a predetermined number of different sites; 

determining candidate cinter pages, the candidate center pages being those 
pointed to by one or more candidate fan pages; and 

discarding all pages in the collection except the candidate fan pages and 
candidate center pages. \ 

7. The method as recited inyclaim 6, wherein the determination of 
candidate fan pages is based on page opntent and the hyper-links pointing 
therefrom. \ 
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8. The i^ethod as recited in claim 5, wherein the step of identifying a 
plurality of community cores includes the step of finding a plurality of (i, j)-cores 
where i and j are the numbers of candidate fan pages and candidate center pages, 
respectively, that appdar in each identified community core. 

9. The methAd as recited in claim 8, wherein the step of finding a 
plurality of (i, j)-cores inqjudes the steps of: 

(a) discarding all candidate center pages that have fewer than i hyper-links 
pointing thereto; \ 

(b) determining all candidate center pages that have i hyper-links pointing 
thereto and determining whether the i hyper-links represent a valid community core 
and \ 

(c) if the i hyper-links represent a valid community core, then outputting the 
valid core, otherwise, discardina the determined candidate center pages. 

1 0. The method as recited in claim 9 further comprising the steps of: 

(d) discarding all candidate llan pages that have fewer than j hyper-links 
pointing therefrom; \ 

(e) determining all candidate fan pages that have j hyper-links pointing 
therefrom and determining whether the jViyper-links represent a valid community 
core; and \ 

(f) if the j hyper-links represent a vaira community core, then outputting the 
valid core, othenwise, discarding the determined candidate fan pages. 
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1 1 . The method as recited in claim 1 0 further comprising the step of 
repeating steps (a)-(f) until every candidate fan page has more than j hyper-links 
pointing therefrom a\id every candidate center page has more than i hyper-links 
pointing thereto. 

12. The method as recited in claim 10 further comprising the step of 
repeating steps (a)-(f) until a predetermined ending condition is satisfied. 

1 3. The methoa as recited in claim 1 0 further comprising the steps of: 
determining all (2,j4cores by examining all pairs of candidate fan pages; 
for i = 3 to n, whereVi is a predetermined value: 

(i) finding all Ai,j)-cores by examining the (i-1 ,j)-cores; and 

(ii) for each {\\ , j)-core, determining whether any of the candidate fan 
pages may be added to the , j)-core to yield a (i,j)-core; and 

removing all (i,j)-cores that appear as subsets of (i',j) cores, where i' > i. 

I^k^ A computer program product for use with a computer system for 
identifying groups of pages of common interest from a collection of hyper-linked 
pages, the computer program product comprising: 

a computer-readable medium; 

means, provided on the computer-readable medium, for directing the system 
to identify a plurality of community corns from the collection, each core being first 
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and second sets bf pages, each page in the first set pointing to every page in the 
second set; and 

means, proVided on the computer-readable medium, for directing the system 
to expand each identified core into a full community, the full community being a 
subset of the pages regarding a particular topic. 



15. The computer program product as recited in claim 14, wherein: 
the collection imcludes/fi plurality of sites, each site having one or more 
hyper-linked pages; and 

the product furthennicludes means, provided on the computer-readable 
medium, for directing thd, system to remove the hyper-links between any two pages 
on a same site. 




1 6. The computei\program product as recited in claim 1 5 further 
comprising means, providedipn the computer-readable medium, for directing the 
system to discard the pages of predetermined sites. 



1 7. The computer program product as recited in claim 1 4 further 
comprising: 

means, provided on the conNputer-readable medium, for directing the system 
to find highly similar pages that havd different names; 

means, provided on the compuier-readable medium, for directing the system 
to replace the highly similar pages with>a single representative page; and 



AM9-99-0203 



:-9 



means, provided on the computer-readable medium, for directing the system 
to redirect any hyper-links that pointed to one of the highly similar pages so that the 
redirected hyper-links now point to the representative page. 



18. The cpmputer program product as recited in claim 14 further 
comprising: 

means, provided on the computer-readable medium, for directing the system 
to discard unnecessary pages from consideration to generate a set of candidate fan 
pages and a set of candidate center pages; and 

means, provideo on the computer-readable medium, for directing the system 
to use the set of candiaate fan pages and set of candidate center pages as the first 
and second sets, respekvely, to identify the community cores. 

19. The computer program product as recited in claim 18, wherein the 
means for directing to discard includes: 

means, provided onuhe computer-readable medium, for directing the system 
to determine candidate fan pages, the candidate fan pages being those pointing to 
at least a predetermined number of different sites; 

means, provided on the computer-readable medium, for directing the system 
to determine candidate centervpages, the candidate center pages being those 
pointed to by one or more canoidate fan pages; and 
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means, provided pn the computer-readable medium, for directing the system 
to discard all pages in thl^ collection except the candidate fan pages and candidate 
center pages. 

20. The computeli program product as recited in claim 19, wherein the 
determination of candidate ran pages is based on page content and the hyper-links 
pointing therefrom. 



21 . The computer program product as recited in claim 1 8, the means for 
directing to identify a plurality o)u;ommunity cores includes means, provided on the 
computer-readable medium, for directing the system to find a plurality of (i, j)-cores 
where i and j are the numbers of Candidate fan pages and candidate center pages, 
respectively, that appear in each identified community core. 



22. The computer program product as recited in claim 21 , wherein the 
means for directing to find a plurality of Vi, j)-cores includes: 

(a) means, provided on the computer-readable medium, for directing the 
system to discard all candidate center pag^s that have fewer than i hyper-links 
pointing thereto; 

(b) means, provided on the computer-Veadable medium, for directing the 
system to determine all candidate center pages that have i hyper-links pointing 
thereto and determining whether the i hyper-lin^s represent a valid community core; 
and 
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(c) means, provided on the computer-readable medium, for directing the 
system to output the wiild core if the I hyper-links represent a valid community core, 
otherwise, to discard the determined candidate center pages. 

23. The computer program product as recited in claim 22 further 
comprising: \ 

(d) means, provideaon the computer-readable medium, for directing the 
system to discard all candidate fan pages that have fewer than j hyper-links pointing 
therefrom; \ 

(e) means, provided on\the computer-readable medium, for directing the 
system to determine all candidati fan pages that have j hyper-links pointing 
therefrom and determining whether the j hyper-links represent a valid community 
core; and \ 

(f) means, provided on the computer-readable medium, for directing the 
system to output the valid core if the j ryper-links represent a valid community core, 
otherwise, discard the determined canoldate fan pages. 

24. The computer program produict as recited in claim 23, wherein the 
operation of means (a)-(f) is repeated until dvery candidate fan page has more than 
j hyper-links pointing therefrom and every candidate center page has more than 1 
hyper-links pointing thereto. \ 
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25. The computer program product as recited in claim 23, wherein the 
operation of means (a)-(f)\is repeated until a predetermined ending condition is 
satisfied. \ 

26. The computer program product as recited in claim 23 further 
comprising: \ 

means, provided on the Aomputer-readable medium, for directing the system 
to determine all (2,j) cores by examining all pairs of candidate fan pages; 

for i = 3 to n, where n is a predetermined value: 

(i) means, provided on the computer-readable medium, for directing 
the system to find all (l,j)-cores by examining the (i-1 ,j)-cores; and 

(li) for each (i-1 , j)-core\means, provided on the computer-readable 
medium, for directing the system to determine whether any of the candidate fan 
pages may be added to the (i-1 , j)-core fo yield a (i,j)-core; and 

means, provided on the computer-readable medium, for directing the system 
to remove all (l,j)-cores that appear as subsets of (i',j) cores, where 1' > i. 
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^ A system for identifying groups of pages of common interest from a 
collection of hyper\inkecl pages, comprising: 

means for identifying a plurality of community cores from the collection, each 
core being first and second sets of pages, each page in the first set pointing to 
every page in the second set; and 

means for expanding each identified core into a full community, the full 
community being a subset of the pages regarding a particular topic. 



28. The systernaa recited in claim 27, wherein: 



Diud 



the collection includ^s^Xl^u''^''^y s'^®^' ^'^^ having one or more 
hyper-linked pages; anc 

the method further incudes the step of removing the hyper-links between 
any two pages on a same sit€ 
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29. The system as recited in claim 28 furtlier comprising means for 
discarding the pages of predetermined sites. 

30. The 4^stem as recited in claim 27 further comprising: 
means for finping highly similar pages that have different names; 
means for repJacing the highly similar pages with a single representative 

page; and \ 

means for redirecting any hyper-links that pointed to one of the highly similar 
pages so that the redirected hyper-links now point to the representative page. 

31 . The system as recited in claim 27 further comprising: 

means for discarcfing unnecessary pages from consideration to generate a 
set of candidate fan pages and a set of candidate center pages; and 

means for using the\set of candidate fan pages and set of candidate center 
pages as the first and second sets, respectively, to identify the community cores. 
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32. The sy^em as recited in claim 31 , wherein the means for discarding 
includes: \ 

means for determViing candidate fan pages, the candidate fan pages being 
those pointing to at least a predetermined number of different sites; 

means for determining candidate center pages, the candidate center pages 
being those pointed to by ome or more candidate fan pages; and 



means for discarding all pages in the collection except the candidate fan 
pages and candidate center pages. 

33. The system as recljted in claim 32, wherein the determination of 
candidate fan pages is based on page content and the hyper-links pointing 
therefrom. \ 

34. The system as recited in claim 31 , the means for identifying a plurality 
of community cores includes means for finding a plurality of (i, j)-cores where i and j 
are the numbers of candidate fan pageaand candidate center pages, respectively, 
that appear in each identified communitycore. 

35. The system as recited In claimi34, wherein the means for finding a 
plurality of (i, j)-cores includes: \ 

(a) means for discarding all candidate center pages that have fewer than i 
hyper-links pointing thereto; \ 
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(b) meahs for determining all candidate center pages that have i hyper-links 
pointing theretoland determining whether the i hyper-links represent a valid 
community core! and 

(c) means for outputting the valid core if the i hyper-links represent a valid 
community core, otherwise, discarding the determined candidate center pages. 



36. The system as recited in claim 35 further comprising: 

(d) means for discarding all candidate fan pages that have fewer than j 
hyper-links pointingltherefrom; 

(e) means for determining all candidate fan pages that have j hyper-links 
pointing therefrom ana determining whether the j hyper-links represent a valid 
community core; and \ 

(f) means for outputting the valid core if the j hyper-links represent a valid 
community core, otherwise, discarding the determined candidate fan pages. 



37. The system as recited in claim 36, wherein the operation of means 
(a)-(f) is repeated until everAcandidate fan page has more than j hyper-links 
pointing therefrom and every i^andidate center page has more than i hyper-links 
pointing thereto. 

38. The system as recited in claim 36, wherein the operation of means 
(a)-(f) is repeated until a predetern^ned ending condition is satisfied. 
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39. The Bystem as recited in claim 36 further comprising: 
means for determining all (2,j) cores by examining all pairs of candidate fan 
pages; \ 

for i = 3 to n, where n is a predetermined value: 

(i) means for finding all (i,j)-cores by examining the (i-1 ,j)-cores; and 

(ii) for each (i-1 , j)-core, means for determining whether any of the 
candidate fan pages mky be added to the (i-1 , j)-core to yield a (i,j)-core; and 

means for removing all (i,j)-cores that appear as subsets of (i'.j) cores, where 
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